This article aims to analyse and provide insights from the monthly transaction data set in order to better understand the customer transaction patterns. The article also offers a study on linear regression model, an important concept in the field of machine learning, and discusses how this model can assist in the decision-making process of identifying trends in bank transactions within the years of 2013 - 2016.
To well capture this information, the CRISP-DM management model is adopted to provide a structured planning approach to a data mining project with 6 high-level phases. In particular, these phases assist companies in comprehending the data mining process and serve as a road map for planning and executing a data mining project (Medeiros, 2021). This study explores each of the six phases, and the tasks associated with each in the following orders:
Cross-Industry Standard Process for Data Mining (CRISP-DM project, 2000)
Business Understanding is the first taken step in the CRISP-DM methodology. In this stage, the main task is to understand the purpose of the analysis and to provide a clear and crisp definition of the problem in respect of understanding the Business objectives and Data mining objectives.
In our case study, the posed question related Business object paraphrased from the sale manager’s request is: “what is driving the trends and increase total monthly revenue?”. On the other hand, we wish to achieve the data mining object by applying data visualization tools to identify any underlying patterns from the dataset.
Following that, the Data Understanding or Exploratory Data Analysis (EDA) phase is where we focus on understanding the data collected to support the Business Understanding and resolve the business challenge (Wijaya, 2021). Visualization techniques play an essential role in this. Thus, The data was imported into the software package R to construct visualizations represented the findings found during the analysis.
Additionally, a two-stage approach is adopted to specify the content in this section, with Stage 1 devoted to basic exploration and Stage 2 devoted to univariate, bivariate, and multivariate analysis.
Once data is loaded, we explore the data with the goal of understanding its dimensions, data types, and distribution of values. In this assignment, a time series data set of financial transactions was used as the major source of data. The attributes information is specifically presented in Appendix []. As apparent from the table below, the data records 470,000+ observations across 5 columns, which are equivalent to 94,000+ bank transactions.
# create a data frame to read the transaction data
df <- read_csv(here("dataset/transactions.csv"))
# Quick summary: index and column data types, non-null values and memory usage
df_overview <- introduce(df) %>%
t() # transpose the info for better format
df_overview %>% # turn output into a table format
kbl() %>%
kable_styling(bootstrap_options = "striped",full_width = F) # apply bootstrap theme to the table
| rows | 94248 |
| columns | 5 |
| discrete_columns | 2 |
| continuous_columns | 3 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 94248 |
| total_observations | 471240 |
| memory_usage | 4203448 |
Alternatively, the code block below illustrates a graphical illustration of the table above.
# Plot the Quick summary information of the data set
plot_intro(df)
Missing values plot
From the plot, it is said that there is no missing values on any fields of data. Nevertheless, some data sets define missing observations in categorical/character columns as a new category such as "NA", "NULL just to name a few. Thus, there are chances that we can possibly miss these observations, which can lay a tremendous negative impact on the real data distribution. Consequently, a further address on the missing values of our categorical columns need to be made in order to confirm this observation.
The output below interprets that there is no new missing value category exists in categorical columns. Thus, we can confirm our hypothesis that there is no missing values from both numerical and categorical columns in this data set.
# convert character values of character columns to upper case for better checking
missing_df <- data.frame(lapply(df, function(v) {
if (is.character(v)) return(toupper(v))
else return(v)
}))
# check if there is there is missing values assigned under new category such as "NA","N/A","NULL",""
# date column
sprintf(paste0("Is there any missing value observation categories in date column (T/F)?: ", missing_df[1] %in% c("NA","N/A","NULL","")))
## [1] "Is there any missing value observation categories in date column (T/F)?: FALSE"
# customer_id column
sprintf(paste0("Is there any missing value observation categories in customer_id column (T/F)?: ", missing_df[1] %in% c("NA","N/A","NULL","")))
## [1] "Is there any missing value observation categories in customer_id column (T/F)?: FALSE"
The 5 features contained in this data set including date, customer_id, industry, location, monthly_amount, clearly indicates the total transactions amounts for customers each month spanning a 3-year period over a range of industries and locations. Therefore, no further justification needs to be made on column names.
It is also worthwhile to note that these features are made up in multiple formats that include both numerical and time-series data. However, the output shows that the date column has the wrong data type which will need to be converted to date format later.
Additionally, I investigate further by looking at the response field. Recall from the business question, we would expect to use the monthly_amount column as the target field since our goal is to get the predicted value of the monthly transaction value next month. Since the observation in this column are continuous, thus, I can conclude that our problem is defined as the supervised regression problem. Having known this information is extremely essential to select the right Machine Learning model in the later stage of this report.
# inspect columns data type
sapply(df, class)
## date customer_id industry location monthly_amount
## "character" "character" "numeric" "numeric" "numeric"
Next,
In research from Sharma (2020), the distributions of the independent variable and the target variable are assumed to be similar in linear models. Therefore, understanding the skewness of data helps us in creating better linear models.
The graph below shows which group of industry and location statistically contribute the most to the significant difference. As can be seen in the histograms below, the location 1 and 2 made the top contributions for the Industry column while the industry 2 and industry 1 occupied for the highest frequency distribution for the Location column.
These results imply that the model can perform better at predicting the total transaction amount for next month with location 1, 2 and/or industry 1, 2.
## plot data distribution of MONTHLY_AMOUNT group by NDUSTRY
par(mfrow=c(1,2)) # combine 2 plots into 1 plot
hist(df$industry,
main = "Trans by Industry",
xlab="Industry",
xlim = c(0,10),
ylim=c(0,50000),
las=0)
## plot data distribution of MONTHLY_AMOUNT group by LOCATION
hist(df$location,
main = "Trans by Location",
xlab="Location",
xlim = c(0,10),
ylim=c(0,50000),
las=0)
Data distribution
## View overall correlation heatmap
plot_correlation(na.omit(df), type = "continuous")
Having known this information is essentially important to gain better understandings about the transaction data set and provide great insights for transforming data in the later stage.
After conducting an inspection of retail transaction data, following issues are indicated:
Date column.# convert date column into the date format
df$date <- as.Date(df$date,"%d/%m/%Y")
# convert customer_id column into character format
df$customer_id = as.character(df$customer_id, format = "")
# convert location column into character format
df$location <- as.character(df$location)
# convert industry column into character format
df$industry <- as.character(df$industry)
# format transaction amount number for the sake of displaying information
df$monthly_amount <- df$monthly_amount/1e6
#######################################################
# TABLE TRANSFORMATION
#######################################################
# create new df contain total transaction amount and number of transaction over time
time_series_df <- sqldf(
"SELECT
date,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY date -- filter by date
ORDER BY date
"
)
# create new df contain total transaction by industry over time
industry <- sqldf(
"SELECT
date,
industry,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY
date, -- filter by date
industry -- filter by industry
ORDER BY date
"
)
# create new df contain total transaction by location over time
location <- sqldf(
"SELECT
date,
location,
SUM(monthly_amount) AS transaction_amount, -- sum total transaction amount
COUNT(*) as transaction_count -- count total number of transactions
FROM df
GROUP BY
date, -- filter by date
location -- filter by location
ORDER BY date
"
)
#######################################################
# DATA VISUALIZATION
#######################################################
# plot transaction amount over time
monthly_amount_plot <- ggplot(time_series_df) +
aes(x = date, y = transaction_amount) +
geom_line(colour = "#B22222") +
geom_point(size = 2, show.legend = FALSE) +
geom_text(aes(date, transaction_amount, label = sprintf("%0.2f", round(transaction_amount, digits = 2)), hjust=-0.1)) +
labs(x = "Year", y = "Total transaction amount in millions") +
theme_minimal() +
transition_reveal(date) # gganimate specific bits
a_gif <- animate(monthly_amount_plot, # animate the plot
fps = 25,
duration = 25,
width = 400, height = 400,
renderer = gifski_renderer("img/animation.gif"))
# plot number of transaction count over time
transaction_count_plot <- ggplot(time_series_df) +
aes(x = date, y = transaction_count) +
geom_line(size = 0.7, colour = "#B22222") +
geom_point() +
geom_text(aes(date, transaction_count, label = sprintf("%0.2f", round(transaction_count)), hjust=-0.1)) +
labs(x = "Year", y = "Number of transaction") +
theme_minimal() +
transition_reveal(date) # gganimate specific bits
b_gif <- animate(transaction_count_plot, # animate the plot
fps = 25,
duration = 25,
width = 400, height = 400,
renderer = gifski_renderer("img/animation2.gif"))
# Read individual images from each GIF
a_mgif <- image_read(a_gif)
b_mgif <- image_read(b_gif)
# join all plots together into 1 single page
new_gif <- image_append(c(a_mgif[1], b_mgif[1]))
for(i in 2:250){
combined <- image_append(c(a_mgif[i], b_mgif[i]))
new_gif <- c(new_gif, combined)
}
new_gif # display final plot
Data distribution
# plot transaction info by industry
industry_amount_plot <- location_amount_plot <- ggplot(industry) +
aes(x = date, y = transaction_amount, colour = industry) +
geom_line(size = 0.5) +
scale_color_hue(direction = 1) +
theme_minimal()
# plot transaction info by location
location_amount_plot <- ggplot(location) +
aes(x = date, y = transaction_amount, colour = location) +
geom_line(size = 0.5) +
scale_color_hue(direction = 1) +
theme_minimal()
ggarrange(industry_amount_plot, location_amount_plot,
labels = c("monthly amount by industry", "monthly amount by location"),
ncol = 2, nrow = 1)
Here the dark squares represent a strong correlation (close to 1) while the lighter ones represent the weaker correlation(close to 0). That’s the reason, all the diagonals are dark blue, as a variable is fully correlated with itself. Now, the thing worth noticing here is that the correlation between newspaper and radio is 0.35. This indicates a fair relationship between newspaper and radio budgets. Hence, it can be inferred that → when the radio budget is increased for a product, there’s a tendency to spend more on newspapers as well. This is called collinearity and is referred to as a situation in which two or more input variables are linearly related. Hence, even though the Multiple Regression model shows no impact on sales by the newspaper, the Simple Regression model still does due to this multicollinearity and the absence of other input variables.
Medeiros, L. (2021, December 19). The CRISP-DM methodology - Lucas Medeiros. Medium. https://medium.com/@lucas.medeiross/the-crisp-dm-methodology-d1b1fc2dc653
Sharma, A. (2020, December 23). What is Skewness in Statistics? | Statistics for Data Science. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/07/what-is-skewness-statistics/Wijaya, C. Y. (2021, December 19).
CRISP-DM Methodology For Your First Data Science Project. Medium. https://towardsdatascience.com/crisp-dm-methodology-for-your-first-data-science-project-769f35e0346c
## load data description csv file
dd <- read_csv(here("dataset/data_description.csv"))
# display the information under table format
dd %>%
kbl() %>%
kable_styling(full_width = F)
| Field | Data Type | Description |
|---|---|---|
| date | Date | Date of the first day of each month |
| customer_id | String | Unique customer identifier |
| industry | Integer | Code for 10 industries, ranging from 1 to 10 |
| location | Integer | Code for 10 locations, ranging from 1 to 10 |
| monthly_amount | Numeric | Total transaction amount for customer in given month |